1 Introduction

The objective of this project was to identify clinical phenotypes in patients with COVID-19, assess their interaction with the response to specific supportive management strategies, and evaluate their association with outcomes.

In this document, we will successively describe the different steps that were undertook :

2 Data

The CODA19 database was used for the project. Data for the following variables were available for analysis :

2.0.1 SQL querying and Data Wrangling

# This chunk of code queries the CODA19 database and runs SQL scripts extracting the aforementionned data
coda19 <- DBI::dbConnect(RSQLite::SQLite(), "../py_eyamga/covidb_version-1.0.0.db")
files_path <- list.files(path = "./sql", full.names=TRUE)
files_names <- as_tibble(str_split_fixed(list.files(path = "./sql"), pattern =".sql", n=2))[[1]]
dflist = list()
for (i in seq_along(files_path)){
  #dbGetQuery(coda19, statement = read_file(files_path[i]))
  tmp <-  dbGetQuery(coda19, statement = read_file(files_path[i]))
  assign(files_names[i], tmp)
  dflist[[i]] = tmp
  write.csv(x = dflist[i], file = paste0(files_names[i], ".csv"))
}

The output of this step was all the aforementioned variables for all COVID 19 hospitalizations. A COVID 19 hospitalization was defined as any hospitalization episode of patient with a positive COVID-19 PCR test within 73 days of the admission. To reduce the number of features, drugs were mapped to MESH classes using the RxClass API and ICD10 codes were mapped to Charlson’s comorbidities.

3 different datasets were obtained at different timestamp from the admission : 24h, 48h, and 72h.

2.0.2 Data description

Here is a preview of the raw dataset at 24 hours.

For more details, take a look a the complete descriptive statistics in the following pdf files.

2.0.3 Handling missing data

We removed all variables for which more than 35% of the data was missing, and all observations for which more than 35% of the variables were missing. We imputed the remaining data using simple CART imputation a methodology that is robust against outliers, multicollinearity and skewed distributions. Imputation was conducted using all variables except the death and the patient identifiers.

# Getting predictorMatrix and modifying it to exclude 2 variables from the imputation set 
matrix_72h <- mice::mice(covid_72h2, method = "cart", m=1, maxit = 0)
pred_matrix72h <- matrix_72h$predictorMatrix
pred_matrix72h[,'patient_site_uid'] <- 0
pred_matrix72h[,'death'] <- 0

# Imputing
covidimputer <- mice::mice(covid_72h2, pred = pred_matrix72h, method = "cart", m=1)
covid72h_imputed <- complete(covidimputer, 1)

The description of the imputed datasets was also provided as pdf files. Briefly :

  • 24h dataset : n=358, 135 variables
  • 48h dataset : n=380, 144 variables
  • 72h dataset : n=400, 144 variables

3 Feature engineering

3.1 Initial PCA and MCA

This will be done using the 72h dataset.

To select the most important variables we conducted PCA on the continuous variables and MCA on the categorical variables.

3.1.1 PCA on all continuous variables

Because the dataset contained continuous variables in different format (min, max, mean), we used PCA to choose the subtype that mostly explained the variance in the data.

pca.coda19 <- PCA(coda19_continuous,
                  quali.sup = c(54, 55),
                  graph = FALSE)

fviz_pca_var(pca.coda19,
             col.var = "contrib", # Color by contributions to the PC alternatives, color by cos2
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )
## Warning: ggrepel: 41 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

The PCA variable plot is messy but still shows use the most important continuous variables in our dataset. To have a better appreciation of the results, below are the variables with most important cos2 on PC1 and PC2.

fviz_cos2(pca.coda19, choice = "var", axes = 1:2, top=20) # Seeing mots important cos2 variables on PC1 and PC2

However, when looking at the overall capacity of our principal components to explain the variation of our data, the 5 PCs only explain 44% of the variance.

head(pca.coda19$eig)
##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1   7.512449              14.174432                          14.17443
## comp 2   4.643389               8.761111                          22.93554
## comp 3   4.076958               7.692373                          30.62792
## comp 4   3.950792               7.454324                          38.08224
## comp 5   3.290256               6.208029                          44.29027
## comp 6   3.112389               5.872432                          50.16270

3.1.2 MCA on all categorical variables

The similar exercise was conducted using MCA on the categorical variables

mca.coda19 <- MCA(coda19_categorical,
                  quali.sup = c(90), # Removing death from observation but MV kept
                  graph = FALSE)

fviz_mca_var(mca.coda19,
             col.var = "contrib", # Color by contributions to the PC alternatives, color by cos2
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )
## Warning: ggrepel: 187 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

The MCA variable plot seems to show a pattern for ICU admitted patients as sedation, muscular blocking agents and mechanical ventilation are all in the same area of the plot. The other medications do not seem to form a particular group.

Looking at the percentage of variance, the results are not particularly powerful because of the curse of dimensionality : in other words, over 90 medications group were included in this portion of the analysis but some were only rarely found in the dataset.

print(head(mca.coda19$eig))
##       eigenvalue percentage of variance cumulative percentage of variance
## dim 1 0.05085437               4.269848                          4.269848
## dim 2 0.04052505               3.402575                          7.672423
## dim 3 0.03257750               2.735281                         10.407703
## dim 4 0.02909950               2.443260                         12.850964
## dim 5 0.02667261               2.239493                         15.090456
## dim 6 0.02565848               2.154344                         17.244801

Once again, we identified the most contributive variables in order to determine which one would be kept for the second portion of the analysis.

N.B. Comorbidities ended not being candidate variables.

fviz_contrib(mca.coda19, choice = "var", axes = 1, top = 10)

3.2 PCA, MCA and FAMD on selected variables

As mentioned earlier, we selected the most important variables as it relates to the initial PCA and MCA analysis to respectively select the candidate continuous and categorical variables for our clustering effort.

We reconducted PCA and MCA with those selected variables only and showed an increase in the variance explained by the principal components validating our selection.

3.2.1 PCA with updated variables

coda19_continuous_labs <- coda19%>%select(c("patient_age", "hemoglobin_min", "plt_max", "wbc_max", "neutrophil_max", 
                                       "lymph_min", "mono_max", "eos_min", "sodium_min",
                                       "creatinine_max"))

coda19_continuous_vitals <- coda19%>%select(c("sbp_max", "sbp_min", "dbp_mean", "temp_max", "temp_min", "so2_min", "rr_mean"))


coda19_continuous2 <- coda19_continuous_labs%>%cbind(coda19_continuous_vitals)
coda19_continuous2$death <- coda19$death
coda19_continuous2$mv <- coda19$mv
coda19_continuous2$icu <- coda19$icu
# Doing PCA
pca.coda19.continuous <- PCA(coda19_continuous2,
                  quali.sup = c(18,19,20), #Removing ICU admission and Death from the PCA analysis
                  graph = FALSE)
fviz_pca_var(pca.coda19.continuous,
             col.var = "contrib", # Color by contributions to the PC alternatives, color by cos2
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

kable(head(pca.coda19.continuous$eig))
eigenvalue percentage of variance cumulative percentage of variance
comp 1 2.099611 12.350656 12.35066
comp 2 2.009039 11.817877 24.16853
comp 3 1.649945 9.705557 33.87409
comp 4 1.586820 9.334238 43.20833
comp 5 1.231291 7.242885 50.45121
comp 6 1.084188 6.377577 56.82879

3.2.2 MCA on selected categorical data

# Selecting categorical data
coda19_categorical2 <- coda19_categorical%>%select("female", "male", "vasopressors", "glucocorticoids", "anti_bacterial_agents", "antifungal_agents",
                                                   "anti_ulcer_agents","immunosuppressive_agents", "diuretics", "platelet_aggregation_inhibitors",
                                                   "sedation","neuromuscular_blocking_agents", "bronchodilator_agents", "hiv_medication", "mv", "icu",
                                                   "death") 


mca.coda19.categorical <- MCA(coda19_categorical2,
                  quali.sup = c(17), # Removing death from observation but MV kept
                  graph = FALSE)

As we can see below, the selected categorical variables are better at describing the variance of the data with cumulative percentage of 62% with 6 PCs.

head(mca.coda19.categorical$eig)
##       eigenvalue percentage of variance cumulative percentage of variance
## dim 1 0.19791808              19.791808                          19.79181
## dim 2 0.12379136              12.379136                          32.17094
## dim 3 0.10032980              10.032980                          42.20392
## dim 4 0.07400790               7.400790                          49.60471
## dim 5 0.06788275               6.788275                          56.39299
## dim 6 0.06330158               6.330158                          62.72315

The MCA does distinguish some patterns. The biplot effectively shows that the selected categorical variables are able to distinguish 2 groups of patients : ward vs icu patients mostly based on the medication.

fviz_mca_biplot(mca.coda19.categorical,
             geom.ind = "point", # show points only (nbut not "text")
             col.ind = coda19_categorical2$icu, # color by death
             palette = c("#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE, # Concentration ellipses
             legend.title = "ICU",
             repel = TRUE
)
## Warning: ggrepel: 20 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

### FAMD analysis including both continuous and categorical variables

The ultimate goal was to use all of the variables available continuous and categorical, in order to mine some underlying clinical patterns. To do so, we conducted a FAMD analysis, a component analysis that uses Gower distance to measure the dissimilarity between observations.

# Preparing the dataset to retain only the selected variables
coda19_complete <- coda19_continuous2%>%select(-c("death","icu","mv"))%>%cbind(coda19_categorical2) 

The results are presented below :

famd.coda19 <- FAMD(coda19_complete,
                    sup.var = c(33, 34), # Removing icu and death from observation but MV kept
                    graph = FALSE)


print(famd.coda19$eig)
##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1   3.251239              10.160121                          10.16012
## comp 2   2.412674               7.539605                          17.69973
## comp 3   2.227780               6.961813                          24.66154
## comp 4   1.894348               5.919836                          30.58138
## comp 5   1.650218               5.156930                          35.73831
fviz_famd_ind(famd.coda19,
              axes = c(1,2),
              geom.ind = "point",
              repel = TRUE,
              label = "none", 
              col.ind = as.factor(coda19_complete$icu), 
              palette = c("#0073C2FF", "#EFC000FF", "#868686FF", "#4A6990FF"),
              addEllipses = TRUE, 
              legend.title = "ICU")
## Warning in if (col.ind %in% c("cos2", "contrib", "coord")) partial = NULL: the
## condition has length > 1 and only the first element will be used

fviz_famd_ind(famd.coda19,
              axes = c(1,2),
              geom.ind = "point",
              repel = TRUE,
              label = "none", 
              col.ind = as.factor(coda19_complete$death), 
              palette = c("#0073C2FF", "#EFC000FF", "#868686FF", "#4A6990FF"),
              addEllipses = TRUE, 
              legend.title = "DEATH")
## Warning in if (col.ind %in% c("cos2", "contrib", "coord")) partial = NULL: the
## condition has length > 1 and only the first element will be used

Visually, the FAMD is not linearly able to separate the data into clusters of clinical outcomes : icu admission or death. Bu several aspects mut consider here : 1) the data is not linearly separable, therefore FAMD might simply not be adequate to conduct the analysis in the first place 2) There are other clusters not related to clinical outcomes that can be identified.

The next step of the project was to conduct cluster analysis and identifying clinical phenotypes based on the selected variables conducting PCA and CMA.

4 Clustering

The objective of clustering is to identify patterns and clusters within the data. It is a well documented unsupervised machine learning method used in multiple fields of molecular biology.

In our case, we leveraged this technique to identify COVID 19 phenotype using the 31 selected variables. More precisely, we used PAM clustering, a method that analogous to K-Means clustering but that uses medoids instead of centroids and a method that is compatible with mixed data types as in our case.

4.1 Determining the optimal number of clusters

# Calculating Gower Distance
coda19_gower <- cluster::daisy(coda19_complete%>%select(-c("death", "icu")), metric = "gower") #Removing death and icu admission from the clustering

# Silhouette analysis for values of K between 2 and 10
set.seed(123)

sil_width <- c(NA)
for(i in 2:10){
  pam_fit <- pam(coda19_gower,
                 diss = TRUE,
                 k = i)
  sil_width[i] <- pam_fit$silinfo$avg.width
  
}

plot(1:10, sil_width,
     xlab = "Number of clusters",
     ylab = "Silhouette Width")
lines(1:10, sil_width)

We chose k=3 as the appropriate numbers of clusters for our dataset as it yielded the highest silhouette width.

pam_fit <- pam(coda19_gower, diss = TRUE, 3)
pam_results <- coda19_complete %>%
  mutate(cluster = pam_fit$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))

clusters <- factor(pam_fit$clustering)

4.2 Clusters analysis

Once the PAM clustering was conducted, we assessed its validity through :

  • Visualization
  • Silhouette analysis
  • Descriptive statistics

4.2.1 Visual representation

set.seed(10)

tsne_obj <- Rtsne(coda19_gower, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = as.factor(pam_fit$clustering))

p <- ggplot(aes(x = X, y = Y,), data = tsne_data) +
    geom_point(aes(color = cluster))+
    ggtitle("3 K PAM clustering")

show(p)

results <- coda19_complete%>%mutate(cluster = pam_fit$clustering)

The following patients represent the medoids :

coda19_complete[pam_fit$medoids, ]
##     patient_age hemoglobin_min plt_max wbc_max neutrophil_max lymph_min
## 31           63            128     228     7.5           6.13      0.85
## 295          64             94     196     3.0           1.86      0.89
## 179          71            123     310     5.6           4.87      0.34
##     mono_max eos_min sodium_min creatinine_max sbp_max sbp_min dbp_mean
## 31      0.60    0.00        138             49     128     108    67.75
## 295     0.18    0.03        139             54     147     101    72.13
## 179     0.57    0.09        139             55     152     103    71.29
##     temp_max temp_min so2_min rr_mean female male vasopressors glucocorticoids
## 31      38.1     36.4      88   20.00      0    1            0               0
## 295     37.8     36.3      86   19.69      0    1            0               0
## 179     37.0     36.1      91   19.60      1    0            0               0
##     anti_bacterial_agents antifungal_agents anti_ulcer_agents
## 31                      1                 0                 0
## 295                     1                 0                 1
## 179                     1                 0                 0
##     immunosuppressive_agents diuretics platelet_aggregation_inhibitors sedation
## 31                         0         0                               0        0
## 295                        0         0                               0        0
## 179                        0         0                               0        0
##     neuromuscular_blocking_agents bronchodilator_agents hiv_medication mv icu
## 31                              0                     0              0  0   0
## 295                             0                     0              0  0   0
## 179                             0                     0              0  0   0
##     death
## 31      0
## 295     0
## 179     0

4.2.2 Descriptive statistics

A pdf file with the distribution of variables accross clusters has been provided.

# This chunk codes exports descriptive statistics according to clusters
autoEDA_results_k <- autoEDA(results, 
                           y = "cluster", returnPlotList = TRUE,
                           outcomeType = "automatic", removeConstant = TRUE, 
                           removeZeroSpread = TRUE, removeMajorityMissing = TRUE, 
                           imputeMissing = TRUE, clipOutliers = FALSE, 
                           minLevelPercentage = 0.025, predictivePower = TRUE, 
                           outlierMethod = "tukey", lowPercentile = 0.01, 
                           upPercentile = 0.99, plotCategorical = "stackedBar", 
                           plotContinuous = "boxplot", bins = 30, 
                           transparency = 0.7,
                           rotateLabels = TRUE, color = "#26A69A", 
                           verbose = FALSE) 

p <- plot_grid(plotlist = autoEDA_results_k$plots, ncol = 3)
ggsave("cluster_analysis_3k_v2.svg", width=20,height=84, limitsize = FALSE)

Below a table one of the different variables across the clusters.

tableone::kableone(tableOne_compared)%>%
  kable_classic(full_width = F, html_font = "Computer Modern")
1 2 3 p test
n 103 106 183
patient_age (mean (SD)) 67.02 (18.09) 70.08 (15.65) 69.95 (20.69) 0.385
hemoglobin_min (mean (SD)) 120.40 (17.93) 112.25 (21.17) 109.90 (19.54) <0.001
plt_max (mean (SD)) 237.87 (108.41) 218.57 (101.98) 267.91 (124.63) 0.001
wbc_max (mean (SD)) 7.90 (6.05) 10.97 (23.93) 8.93 (4.91) 0.231
neutrophil_max (mean (SD)) 6.02 (4.79) 6.45 (3.98) 6.91 (4.45) 0.257
lymph_min (mean (SD)) 1.04 (1.66) 0.83 (0.57) 1.05 (0.70) 0.176
mono_max (mean (SD)) 0.66 (0.35) 0.74 (0.96) 0.70 (0.44) 0.635
eos_min (mean (SD)) 0.03 (0.06) 0.05 (0.10) 0.06 (0.17) 0.251
sodium_min (mean (SD)) 128.46 (29.81) 130.53 (25.92) 132.54 (24.95) 0.452
creatinine_max (mean (SD)) 133.83 (153.66) 149.09 (174.37) 118.26 (152.60) 0.277
sbp_max (mean (SD)) 144.94 (19.76) 151.37 (19.63) 148.93 (23.33) 0.092
sbp_min (mean (SD)) 110.94 (14.59) 112.31 (16.88) 110.80 (17.45) 0.740
dbp_mean (mean (SD)) 73.73 (11.32) 71.92 (7.74) 70.79 (8.57) 0.035
temp_max (mean (SD)) 37.72 (1.01) 37.83 (0.91) 37.64 (1.01) 0.320
temp_min (mean (SD)) 36.18 (0.64) 36.07 (0.62) 36.15 (0.61) 0.369
so2_min (mean (SD)) 87.42 (15.87) 88.42 (10.52) 90.37 (7.60) 0.076
rr_mean (mean (SD)) 20.73 (2.58) 21.05 (3.57) 20.40 (2.53) 0.174
female = 1 (%) 0 ( 0.0) 0 ( 0.0) 183 (100.0) <0.001
male = 1 (%) 103 (100.0) 106 (100.0) 0 ( 0.0) <0.001
vasopressors = 1 (%) 13 ( 12.6) 30 ( 28.3) 26 ( 14.2) 0.003
glucocorticoids = 1 (%) 7 ( 6.8) 20 ( 18.9) 17 ( 9.3) 0.012
anti_bacterial_agents = 1 (%) 65 ( 63.1) 80 ( 75.5) 103 ( 56.3) 0.005
antifungal_agents = 1 (%) 0 ( 0.0) 1 ( 0.9) 1 ( 0.5) 0.630
anti_ulcer_agents = 1 (%) 0 ( 0.0) 106 (100.0) 85 ( 46.4) <0.001
immunosuppressive_agents = 1 (%) 0 ( 0.0) 5 ( 4.7) 4 ( 2.2) 0.074
diuretics = 1 (%) 13 ( 12.6) 28 ( 26.4) 44 ( 24.0) 0.031
platelet_aggregation_inhibitors = 1 (%) 15 ( 14.6) 42 ( 39.6) 37 ( 20.2) <0.001
sedation = 1 (%) 1 ( 1.0) 23 ( 21.7) 13 ( 7.1) <0.001
neuromuscular_blocking_agents = 1 (%) 0 ( 0.0) 1 ( 0.9) 0 ( 0.0) 0.259
bronchodilator_agents = 1 (%) 25 ( 24.3) 27 ( 25.5) 42 ( 23.0) 0.887
hiv_medication = 1 (%) 2 ( 1.9) 1 ( 0.9) 2 ( 1.1) 0.777
mv = 1 (%) 0 ( 0.0) 8 ( 7.5) 4 ( 2.2) 0.004
icu = 1 (%) 14 ( 13.6) 40 ( 37.7) 24 ( 13.1) <0.001
death = 1 (%) 19 ( 18.4) 34 ( 32.1) 41 ( 22.4) 0.055
cluster (mean (SD)) 1.00 (0.00) 2.00 (0.00) 3.00 (0.00) <0.001

5 Experiments

In the following section, we experimented with different iteration of the dataset to observe its impact on clustering.

5.1 Data as above without gender

# Calculating Gower Distance
coda19_gower_nogender <- cluster::daisy(coda19_complete%>%select(-c('male','female', 'icu', 'death')) , metric = "gower")
# Removing gender and outcome variables from the clustering 

# Clustering PAM
pam_fit_nogender <- pam(coda19_gower_nogender, diss = TRUE, 3)

# Plotting on TSNE
tsne_obj <- Rtsne(coda19_gower_nogender, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = as.factor(pam_fit_nogender$clustering))

p <- ggplot(aes(x = X, y = Y,), data = tsne_data) +
    geom_point(aes(color = cluster))+
    ggtitle("3 K PAM clustering")

show(p)

results_nogender <- coda19_complete%>%mutate(cluster = pam_fit_nogender$clustering)

alllist <- names(results_nogender)
catlist <- names(results_nogender[sapply(results_nogender, is.factor)])
tableOne_compared <- tableone::CreateTableOne(vars = alllist, data = results_nogender, strata = "cluster", factorVars = catlist, test=TRUE)

tableone::kableone(tableOne_compared)%>%
  kable_classic(full_width = F, html_font = "Computer Modern")
1 2 3 p test
n 189 143 60
patient_age (mean (SD)) 65.65 (18.18) 73.29 (20.27) 70.72 (14.39) 0.001
hemoglobin_min (mean (SD)) 115.26 (19.47) 115.24 (19.17) 102.45 (20.56) <0.001
plt_max (mean (SD)) 231.26 (102.94) 250.61 (108.09) 285.83 (159.52) 0.006
wbc_max (mean (SD)) 9.75 (18.39) 7.70 (4.54) 11.10 (5.37) 0.184
neutrophil_max (mean (SD)) 6.51 (4.43) 5.68 (4.15) 8.75 (4.35) <0.001
lymph_min (mean (SD)) 0.95 (1.26) 1.19 (0.74) 0.64 (0.55) 0.001
mono_max (mean (SD)) 0.70 (0.74) 0.66 (0.32) 0.84 (0.64) 0.137
eos_min (mean (SD)) 0.03 (0.07) 0.07 (0.14) 0.07 (0.22) 0.022
sodium_min (mean (SD)) 129.98 (27.01) 132.62 (25.64) 129.87 (27.46) 0.634
creatinine_max (mean (SD)) 121.49 (140.16) 106.24 (121.80) 217.92 (245.17) <0.001
sbp_max (mean (SD)) 145.33 (20.59) 148.20 (21.22) 159.47 (22.09) <0.001
sbp_min (mean (SD)) 108.33 (13.93) 114.31 (16.66) 113.13 (21.85) 0.003
dbp_mean (mean (SD)) 70.87 (8.80) 73.05 (9.96) 72.20 (8.48) 0.098
temp_max (mean (SD)) 37.99 (0.96) 37.22 (0.87) 38.02 (0.89) <0.001
temp_min (mean (SD)) 36.27 (0.59) 36.07 (0.64) 35.89 (0.59) <0.001
so2_min (mean (SD)) 88.03 (12.46) 92.34 (4.65) 84.53 (15.00) <0.001
rr_mean (mean (SD)) 21.06 (3.06) 19.91 (1.93) 21.23 (3.65) <0.001
female = 1 (%) 75 (39.7) 80 (55.9) 28 ( 46.7) 0.013
male = 1 (%) 114 (60.3) 63 (44.1) 32 ( 53.3) 0.013
vasopressors = 1 (%) 14 ( 7.4) 9 ( 6.3) 46 ( 76.7) <0.001
glucocorticoids = 1 (%) 24 (12.7) 9 ( 6.3) 11 ( 18.3) 0.031
anti_bacterial_agents = 1 (%) 188 (99.5) 0 ( 0.0) 60 (100.0) <0.001
antifungal_agents = 1 (%) 1 ( 0.5) 0 ( 0.0) 1 ( 1.7) 0.314
anti_ulcer_agents = 1 (%) 78 (41.3) 56 (39.2) 57 ( 95.0) <0.001
immunosuppressive_agents = 1 (%) 6 ( 3.2) 0 ( 0.0) 3 ( 5.0) 0.051
diuretics = 1 (%) 21 (11.1) 20 (14.0) 44 ( 73.3) <0.001
platelet_aggregation_inhibitors = 1 (%) 40 (21.2) 32 (22.4) 22 ( 36.7) 0.042
sedation = 1 (%) 0 ( 0.0) 3 ( 2.1) 34 ( 56.7) <0.001
neuromuscular_blocking_agents = 1 (%) 0 ( 0.0) 0 ( 0.0) 1 ( 1.7) 0.062
bronchodilator_agents = 1 (%) 49 (25.9) 27 (18.9) 18 ( 30.0) 0.163
hiv_medication = 1 (%) 2 ( 1.1) 1 ( 0.7) 2 ( 3.3) 0.291
mv = 1 (%) 0 ( 0.0) 0 ( 0.0) 12 ( 20.0) <0.001
icu = 1 (%) 26 (13.8) 10 ( 7.0) 42 ( 70.0) <0.001
death = 1 (%) 46 (24.3) 24 (16.8) 24 ( 40.0) 0.002
cluster (mean (SD)) 1.00 (0.00) 2.00 (0.00) 3.00 (0.00) <0.001

5.2 ICU patients at 72h n = 78, 44 variables

# Loading the ICU dataset

# covidicu72h_imputed <- read_csv (./"covidicu72h_imputed_filtered.csv)

# Converting appropriate columns to factor
covidicu72h_imputed <- covidicu72h_imputed%>%mutate_if(function(x){(length(unique(x)) <= 3)}, function(x){as.factor(x)})

# Calculating Gower Distance
coda19_gower_icu <- cluster::daisy(covidicu72h_imputed%>%select(-c('male','female','death')) , metric = "gower")

# Removing gender and outcome variables from the clustering 


# Optimal K
sil_width <- c(NA)
for(i in 2:10){
  pam_fit <- pam(coda19_gower_icu,
                 diss = TRUE,
                 k = i)
  sil_width[i] <- pam_fit$silinfo$avg.width
  
}

plot(1:10, sil_width,
     xlab = "Number of clusters",
     ylab = "Silhouette Width")
lines(1:10, sil_width)

# Clustering PAM
pam_fit_icu <- pam(coda19_gower_icu, diss = TRUE, 3)

set.seed(25)
# Plotting on TSNE
tsne_obj <- Rtsne(coda19_gower_icu, is_distance = TRUE, perplexity = 15)
tsne_data <- tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = as.factor(pam_fit_icu$clustering))

p <- ggplot(aes(x = X, y = Y,), data = tsne_data) +
    geom_point(aes(color = cluster))+
    ggtitle("3 K PAM clustering")

show(p)

results_icu <- covidicu72h_imputed%>%mutate(cluster = pam_fit_icu$clustering)

alllist <- names(results_icu)
catlist <- names(results_icu[sapply(results_icu, is.factor)])
tableOne_compared <- tableone::CreateTableOne(vars = alllist, data = results_icu, strata = "cluster", factorVars = catlist, test=TRUE)

tableone::kableone(tableOne_compared)%>%
  kable_classic(full_width = F, html_font = "Computer Modern")
1 2 3 p test
n 27 17 34
patient_age (mean (SD)) 63.22 (14.15) 55.24 (15.81) 66.41 (10.54) 0.020
hemoglobin_min (mean (SD)) 115.81 (15.80) 99.71 (21.94) 104.71 (19.78) 0.016
plt_max (mean (SD)) 265.74 (122.72) 199.00 (96.95) 304.18 (116.24) 0.011
wbc_max (mean (SD)) 9.03 (5.37) 9.94 (5.14) 11.64 (4.90) 0.138
neutrophil_max (mean (SD)) 7.24 (5.36) 7.69 (4.11) 9.14 (3.41) 0.209
lymph_min (mean (SD)) 0.75 (0.35) 0.47 (0.21) 0.60 (0.43) 0.041
mono_max (mean (SD)) 0.65 (0.41) 0.71 (0.57) 0.75 (0.52) 0.769
eos_min (mean (SD)) 0.02 (0.05) 0.01 (0.01) 0.06 (0.11) 0.050
sodium_min (mean (SD)) 123.37 (37.06) 135.76 (2.68) 125.44 (35.99) 0.439
chloride_min (mean (SD)) 94.59 (20.09) 101.41 (3.20) 92.06 (24.60) 0.301
albumin_min (mean (SD)) 30.22 (4.47) 27.76 (5.25) 26.47 (3.75) 0.005
bicarbonate_min (mean (SD)) 23.27 (2.93) 21.04 (4.35) 19.96 (3.67) 0.003
bun_max (mean (SD)) 8.48 (5.76) 11.22 (9.00) 14.59 (9.17) 0.017
glucose_max (mean (SD)) 8.83 (4.32) 9.98 (3.56) 13.60 (4.61) <0.001
anion_gap_mean (mean (SD)) 10.21 (2.39) 8.98 (1.73) 11.00 (2.78) 0.025
ptt_max (mean (SD)) 28.85 (7.50) 42.35 (30.16) 49.00 (30.00) 0.009
alt_max (mean (SD)) 62.93 (51.74) 37.76 (25.11) 45.09 (50.39) 0.172
ast_max (mean (SD)) 78.89 (65.84) 56.88 (30.61) 64.26 (50.61) 0.366
bili_tot_max (mean (SD)) 11.70 (7.62) 11.47 (7.38) 12.76 (7.55) 0.796
tropot_max (mean (SD)) 46.44 (75.46) 91.41 (122.86) 237.56 (742.08) 0.303
lipase_max (mean (SD)) 61.59 (59.71) 131.35 (231.82) 56.62 (42.26) 0.080
ck_max (mean (SD)) 808.74 (1444.08) 535.65 (660.21) 1064.09 (1555.87) 0.423
weight_mean (mean (SD)) 75.89 (14.90) 80.48 (9.89) 86.74 (18.58) 0.032
fio2_max (mean (SD)) 54.78 (29.31) 68.18 (27.21) 88.68 (17.72) <0.001
lactate_max (mean (SD)) 1.73 (1.41) 1.63 (1.60) 1.96 (1.00) 0.647
svo2sat_min (mean (SD)) 50.33 (22.84) 62.12 (18.38) 60.29 (21.02) 0.113
temp_max (mean (SD)) 38.41 (0.86) 38.49 (0.88) 37.86 (0.61) 0.005
female = 1 (%) 7 (25.9) 4 (23.5) 13 (38.2) 0.448
male = 1 (%) 20 (74.1) 13 (76.5) 21 (61.8) 0.448
vasopressors = 1 (%) 4 (14.8) 14 (82.4) 30 (88.2) <0.001
glucocorticoids = 1 (%) 2 ( 7.4) 2 (11.8) 8 (23.5) 0.200
anti_bacterial_agents = 1 (%) 21 (77.8) 15 (88.2) 32 (94.1) 0.164
antifungal_agents = 1 (%) 1 ( 3.7) 0 ( 0.0) 0 ( 0.0) 0.384
immunosuppressive_agents = 1 (%) 1 ( 3.7) 0 ( 0.0) 3 ( 8.8) 0.370
diuretics = 1 (%) 2 ( 7.4) 1 ( 5.9) 29 (85.3) <0.001
platelet_aggregation_inhibitors = 1 (%) 7 (25.9) 4 (23.5) 10 (29.4) 0.896
sedation = 1 (%) 0 ( 0.0) 11 (64.7) 26 (76.5) <0.001
neuromuscular_blocking_agents = 1 (%) 0 ( 0.0) 0 ( 0.0) 1 ( 2.9) 0.519
bronchodilator_agents = 1 (%) 5 (18.5) 2 (11.8) 6 (17.6) 0.825
hiv_medication = 1 (%) 1 ( 3.7) 0 ( 0.0) 1 ( 2.9) 0.738
mv = 1 (%) 0 ( 0.0) 2 (11.8) 10 (29.4) 0.006
death = 1 (%) 1 ( 3.7) 6 (35.3) 18 (52.9) <0.001
cluster (mean (SD)) 1.00 (0.00) 2.00 (0.00) 3.00 (0.00) <0.001

5.2.1 ICU data using continuous variables only

covidicu72h_imputed_cont <- covidicu72h_imputed%>%select(c("patient_age", "hemoglobin_min", "plt_max", "wbc_max", "neutrophil_max", 
                                            "lymph_min", "mono_max", "eos_min", "sodium_min", "chloride_min", "albumin_min", "bicarbonate_min",
                                            "bun_max", "glucose_max", "anion_gap_mean", "ptt_max", "alt_max", "ast_max", "bili_tot_max",
                                            "tropot_max",
                                            "lipase_max", "ck_max", "weight_mean", "fio2_max","lactate_max", "svo2sat_min", "temp_max"
                                            ,"female", "male", "mv", "death"))


pca.coda19 <- PCA(covidicu72h_imputed_cont,
                  quali.sup = c(28,29, 30,31), # cat variables from the dataset not considered : death, mv, female/male
                  graph = FALSE)
coda19.kmeans <- kmeans(covidicu72h_imputed_cont[, 1:28], centers = 3, nstart = 60)

coda19.obs.clusters <- as.factor(coda19.kmeans$cluster)

# Color variables by groups
fviz_pca_ind(pca.coda19, 
             geom.ind = "point",
             col.ind = coda19.obs.clusters, 
             palette = c("#0073C2FF", "#EFC000FF", "#868686FF"),
             addEllipses = FALSE, 
             legend.title = "Cluster")

results_icu <- covidicu72h_imputed_cont%>%mutate(cluster = coda19.obs.clusters)

alllist <- names(results_icu)
catlist <- names(results_icu[sapply(results_icu, is.factor)])
tableOne_compared <- tableone::CreateTableOne(vars = alllist, data = results_icu, strata = "cluster", factorVars = catlist, test=TRUE)

tableone::kableone(tableOne_compared)%>%
  kable_classic(full_width = F, html_font = "Computer Modern")
1 2 3 p test
n 1 9 68
patient_age (mean (SD)) 65.00 (NA) 67.33 (11.02) 62.25 (13.99) NA
hemoglobin_min (mean (SD)) 96.00 (NA) 106.33 (21.81) 107.78 (19.82) NA
plt_max (mean (SD)) 354.00 (NA) 271.44 (107.31) 266.22 (122.99) NA
wbc_max (mean (SD)) 12.40 (NA) 14.67 (7.48) 9.77 (4.62) NA
neutrophil_max (mean (SD)) 11.59 (NA) 11.50 (4.43) 7.67 (4.18) NA
lymph_min (mean (SD)) 0.00 (NA) 0.44 (0.25) 0.66 (0.38) NA
mono_max (mean (SD)) 1.36 (NA) 0.95 (0.61) 0.66 (0.47) NA
eos_min (mean (SD)) 0.23 (NA) 0.05 (0.09) 0.03 (0.08) NA
sodium_min (mean (SD)) 141.00 (NA) 124.56 (38.60) 127.09 (31.89) NA
chloride_min (mean (SD)) 103.00 (NA) 89.78 (28.23) 95.54 (19.28) NA
albumin_min (mean (SD)) 22.00 (NA) 25.78 (4.29) 28.44 (4.57) NA
bicarbonate_min (mean (SD)) 28.50 (NA) 21.11 (3.69) 21.27 (3.82) NA
bun_max (mean (SD)) 10.60 (NA) 15.64 (9.53) 11.24 (8.30) NA
glucose_max (mean (SD)) 7.40 (NA) 10.88 (2.88) 11.25 (5.01) NA
anion_gap_mean (mean (SD)) 7.00 (NA) 11.47 (2.84) 10.18 (2.47) NA
ptt_max (mean (SD)) 31.00 (NA) 40.44 (14.45) 40.74 (27.35) NA
alt_max (mean (SD)) 67.00 (NA) 75.67 (59.20) 45.97 (45.06) NA
ast_max (mean (SD)) 127.00 (NA) 128.78 (63.03) 58.76 (46.31) NA
bili_tot_max (mean (SD)) 17.00 (NA) 18.44 (12.90) 11.21 (6.13) NA
tropot_max (mean (SD)) 4340.00 (NA) 216.22 (287.34) 67.63 (82.23) NA
lipase_max (mean (SD)) 84.00 (NA) 101.78 (59.18) 70.90 (124.96) NA
ck_max (mean (SD)) 325.00 (NA) 4340.67 (1194.88) 407.79 (380.58) NA
weight_mean (mean (SD)) 62.00 (NA) 83.17 (19.47) 81.71 (15.97) NA
fio2_max (mean (SD)) 100.00 (NA) 86.78 (27.99) 70.18 (28.13) NA
lactate_max (mean (SD)) 1.60 (NA) 2.27 (1.38) 1.75 (1.28) NA
svo2sat_min (mean (SD)) 88.00 (NA) 57.33 (24.30) 56.78 (21.12) NA
temp_max (mean (SD)) 38.60 (NA) 38.24 (0.49) 38.17 (0.85) NA
female = 1 (%) 0 ( 0.0) 2 ( 22.2) 22 ( 32.4) 0.659
male = 1 (%) 1 (100.0) 7 ( 77.8) 46 ( 67.6) 0.659
mv = 1 (%) 1 (100.0) 5 ( 55.6) 6 ( 8.8) <0.001
death = 1 (%) 0 ( 0.0) 3 ( 33.3) 22 ( 32.4) 0.786
cluster (%) <0.001
1 1 (100.0) 0 ( 0.0) 0 ( 0.0)
2 0 ( 0.0) 9 (100.0) 0 ( 0.0)
3 0 ( 0.0) 0 ( 0.0) 68 (100.0)